7 Applied example

Please note that the data set used in this section is fairly large (~70,000 rows) and so the figures below might take some time to load.

This section demonstrates the workflow and features of plotscaper using a large real-world dataset focused on a pressing global issue: mental health.

Mental health is a global concern. In developed nations, mental health disorders are primary contributor to years lived with disability, significantly impacting both the well-being of individuals and the economic productivity of entire countries (Organization 2022). This issue, however, extends beyond developed nations. The global burden of mental health disorders has been steadily rising over the recent decades, a trend which is particularly concerning in low-income countries where access to mental health services is even more limited (Patel et al. 2018).

Having had personal experience with friends and relatives impacted by mental health issues, as well as having majored in psychology during my undergraduate studies, I have had a long-standing interest in the topic. It is clear to me that mental health is a key challenge that the world is facing today, and the first step towards solving it will require clearly identifying key trends and patterns. For this reason, I chose to dedicate this section to an exploration of a large longitudinal mental health data set.

7.0.1 About the data set

The Institute of Health Information and Statistics of the Czech Republic (IHIS, ÚZIS in Czech) is a government agency established by the Czech Ministry of Health. Its primary roles is to collect, process, and report on medical data within the country of Czechia (ÚZIS 2024). Of interest, the institute provides high-quality, open-access medical data, including information about the use and manufacture of medicines, summaries of fiscal and employment records in medical facilities, and various epidemiological data sets.

The Long-term care data set (Soukupová et al. 2023) contains longitudinal information about long-term psychiatric care in Czechia. More specifically, it contains aggregated data on individuals released from psychiatric care facilities between 2010 and 2022. It includes information such the region of the treatment facility, the sex of the patients, age category, diagnosis based on the international ICD-10 classification (World Health Organization 2024a, 2024b), the number of hospitalizations, and the total number of days spent in care by the given subset of patients.

Here’s the data set at a quick glance:

df <- read.csv("./data/longterm_care.csv")
dplyr::glimpse(df)
## Rows: 68,115
## Columns: 12
## $ year                   <int> 2019, 2016, 2011, 2013, 2019, 2013, 2018, 2017, 2019, 2015, 20…
## $ region_code            <chr> "CZ071", "CZ064", "CZ080", "CZ072", "CZ080", "CZ053", "CZ080",…
## $ region                 <chr> "Olomoucký kraj", "Jihomoravský kraj", "Moravskoslezský kraj",…
## $ sex                    <chr> "female", "male", "male", "female", "male", "female", "male", …
## $ diagnosis              <chr> "f10", "f2", "f7", "f4 without f42", "f60–f61", "f32–f33", "f3…
## $ reason_for_termination <chr> "release", "early termination", "early termination", "early te…
## $ age_category           <chr> "40–49", "30–39", "30–39", "40–49", "50–59", "40–49", "60–69",…
## $ stay_category          <chr> "short-term", "medium-term", "short-term", "short-term", "shor…
## $ field                  <chr> "psychiatry", "psychiatry", "psychiatry", "psychiatry", "psych…
## $ care_category          <chr> "adult", "adult", "adult", "adult", "adult", "adult", "adult",…
## $ cases                  <int> 13, 3, 2, 3, 1, 2, 2, 32, 2, 2, 1, 12, 1, 1, 2, 9, 18, 1, 1, 1…
## $ days                   <int> 196, 345, 38, 108, 1, 47, 319, 3813, 120, 256, 151, 574, 49, 2…

The data contains over 68,000 rows, totaling over 410,000 hospitalizations. Each row records the number patients with a particular set of of characteristics released from a treatment facility during a given year, and the number of days the patients spent in treatment in total.

The original dataset used Czech column names and category labels. To make the analysis more easily accessible to non-Czech speakers, I took the liberty of translating most of these to English (excluding the region variable). The translation script is available in the thesis repository, at the following path: ./data/longterm_care_translate.R. Additionally, the data set website contains a JSON schema with a text description of each of the variables (Soukupová et al. 2023). I took the liberty of translating these descriptions as well, and provide them below, in table Table 7.1:

Table 7.1: Schema of the long-term care data set, including the original column names (Czech), as well as translated names and descriptions.
Translated name Original name Description
year rok The year hospitalization was terminated
region_code kraj_kod Region code based on the NUTS 3 classification
region kraj_nazev Region where the facility was located
sex pohlavi Classification of patients’ sex
diagnosis zakladni_diagnoza The primary diagnosis of the psychiatric disorder based on the ICD-10 classification
reason_for_termination ukonceni The reason for termination of care
age_category vekova_kategorie Classification of patients’ age category
stay_category kategorie_delky_hospitalizace Classification of hospitalization based on length: short-term (< 3 months), medium-term (3-6 months), and long-term (6+ months)
field obor The field of psychiatric care
care_category kategorie_pece Classification of care: child or adult
cases pocet_hospitalizaci The total number of cases/hospitalizations in the given subgroup of patients
days delka_hospitlizace The total time spent in care, in days (= sum of the number of days all patients in a given subgroup spent in care)

7.0.2 Interactive exploration

Let’s jump into exploring the data.

7.0.2.1 The relationship between cases and days

The first to note about the data set is that the data has been aggregated, such that each row represents the combined number of releases within a given subset of patients. For example, the first row indicates that, in the year 2019, 13 women aged 40-49 were released from treatment facilities in Olomoucký kraj (region), after receiving short-term care for F10 ICD-10 diagnosis (mental and behavioural disorders due to alcohol use, World Health Organization 2024a) for a sum total of 196 days:

df[1, ]
##   year region_code         region    sex diagnosis reason_for_termination age_category
## 1 2019       CZ071 Olomoucký kraj female       f10                release        40–49
##   stay_category      field care_category cases days
## 1    short-term psychiatry         adult    13  196

Thus, the two primary continuous variables of interest are cases (the number of patients in a given subgroup released from care) and days (the total number of days the given subgroup of patients spent in care). Intuitively, we should expect a fairly linear relationship between these variables, such that a larger group of patients should spend a greater number of days in care, in total. We can use plotscaper to visualize this relationship via a scatterplot:

library(plotscaper) # Load in the package

create_schema(df) |> # Create a schema that can be modified declaratively
  add_scatterplot(c("cases", "days")) |>
  render() # Render the figure

Interestingly, the data does not exhibit a simple linear relationship. Instead, there seems to be a leaf-shaped pattern, with three distinct, roughly linear “leaflets”, suggesting a pooling effect. Closer inspection of the data reveals that the stay_category variable has three levels, corresponding to short-term (< 3 months), medium-term (3-6 months), and long-term (6+ months) care. Color-coding the cases by these categories indeed confirms that these correspond to the three observed leaflets:

df |>
  create_schema() |>
  add_scatterplot(c("cases", "days")) |>
  add_barplot(c("stay_category", "cases")) |> # y-axis isn't just count but weighted by cases
  assign_cases(which(df$stay_category == "short-term"), 1) |> # Mark short-term cases green
  assign_cases(which(df$stay_category == "long-term"), 2) |> # Mark long-term cases red
  render()

Try clicking on the bars in the barplot to confirm there is a fairly minimal overlap between the data points in the three categories.

However, the pooling effect does not fully explain the absence of points between the three leaflets. If the distribution of cases and days within each of the three stay_category levels were uniform, we should expect to see more points within the gaps. This suggests a potential selection process, where patients are less likely to be discharged at durations near the category boundaries. We can confirm this by plotting the average number of days a group of patients spent in care:

df$avg_days <- df$days / df$cases

df |>
  create_schema() |>
  add_scatterplot(c("cases", "avg_days")) |>
  add_barplot(c("stay_category", "cases")) |>
  assign_cases(which(df$stay_category == "short-term"), 1) |>
  assign_cases(which(df$stay_category == "long-term"), 2) |>
  set_scale("scatterplot1", "y", transformation = "log10", default = TRUE) |>
  render()

Now we can clearly see the gaps between the three different distributions along the y-axis.

Try querying the points near the gaps by pressing the Q key and hovering over them. You will observe that these gaps roughly span the 60-90 day and 150-210 day ranges, corresponding to 2-3 months and 5-7 months, respectively. This is a strong indication of a selection process is at work. Specifically, it seems that patients staying for more than two months are likely to be transitioned to medium-term care and kept around for longer, while those staying for more than five months are similarly likely to be moved to long-term care. There are likely mundane administrative or health-care provider reasons for this, however, it is still an interesting pattern.

7.0.2.2 Number of cases over time

An important question to answer is how has the number of patients and the number of days they spent in care changed over time. We can investigate this by plotting the same scatterplot as we did in the section above, as well as two barplots showing the total number of cases and days within each year. We can also include a histogram of the number of days, for good measure:

schema <- df |>
  create_schema() |>
  add_scatterplot(c("cases", "days")) |>
  add_barplot(c("year", "cases")) |>
  add_barplot(c("year", "days")) |>
  add_histogram(c("days")) |>
  set_parameters("histogram1", width = 20) # Set histogram binwidth to 20 days

schema |> render()

From the barplots, we can immediately see an interesting pattern: while the numbers of cases seem to have declined over time, the number of days patients spend in care seem to seem to have stayed fairly constant. This suggest that while fewer patients are being hospitalized, they are spending longer time in care.

We can indeed confirm this interactively. Try clicking on the bars corresponding to the year 2010 and 2022, in either of the two barplots (feel free to mark either of the bars in red or green by holding the 1 or 2 keys and clicking the bars). It is clearly visible that, compared with 2010, in 2022 there were more patients in long term care, and the relationship between the number of cases and the total number of days was steeper.

While the declining number of cases over time might initially appear positive, the constant number days in treatment suggests a more concerning trend. Specifically, given that treatment facility placements are limited, the increasing number of patients staying in care for longer durations may be burdening the healthcare system, reducing its capacity to serve new patients.

We can also scrutinize the trend more closely by zooming into the histogram:

schema |> 
  assign_cases(which(df$year == 2010), 1) |>
  assign_cases(which(df$year == 2022), 2) |>
  zoom("histogram1", c(-100, 0, 1500, 2000), units = "data") |>
  render()

References

Organization, World Health. 2022. World Mental Health Report: Transforming Mental Health for All. World Health Organization.
Patel, Vikram, Shekhar Saxena, Crick Lund, Graham Thornicroft, Florence Baingana, Paul Bolton, Dan Chisholm, et al. 2018. “The Lancet Commission on Global Mental Health and Sustainable Development.” The Lancet 392 (10157): 1553–98.
Soukupová, J, H Melicharová, O Šanca, V Bartůněk, J Jarkovský, and M Komenda. 2023. Dlouhodobá psychiatrická péče.” NZIP. https://www.nzip.cz/data/2060-dlouhodoba-psychiatricka-pece.
ÚZIS. 2024. About us - ÚZIS ČR.” https://www.uzis.cz/index-en.php?pg=about-us.
World Health Organization. 2024a. ICD-10 Version:2019.” https://icd.who.int/browse10/2019/en.
———. 2024b. International Classification of Diseases (ICD).” https://www.who.int/standards/classifications/classification-of-diseases.